Hiring Diagonostics

Introducing Multiple Regression

Robert W. Walker

Multiple Regression

The Problem

The HR Department needs a GamingDevices diagnostic for choosing among applicants for employment. The goal is to maximize the productivity of chosen future employees by building and deploying a model for predicting and explaining worker productivity. We are to explain With as a function, potentially, of:

  • Dexterity: a diagnostic based on manual dexterity
  • General: a diagnostic based on general aptitude
  • Without: measured productivity over a four week period prior to introducing the device.
  • With: measured productivity over a four week period with the handheld device.

Summary Statistics

Explore

What can we say?

With the device?

result <- single_mean(GamingDevices, var = "With")
summary(result)
Single mean test
Data      : GamingDevices 
Variable  : With 
Confidence: 0.95 
Null hyp. : the mean of With = 0 
Alt. hyp. : the mean of With is not equal to 0 

    mean  n n_missing     sd    se    me
 109.000 36         0 12.247 2.041 4.144

 diff    se t.value p.value df    2.5%   97.5%    
  109 2.041  53.399  < .001 35 104.856 113.144 ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Without the device?

result <- single_mean(GamingDevices, var = "Without")
summary(result)
Single mean test
Data      : GamingDevices 
Variable  : Without 
Confidence: 0.95 
Null hyp. : the mean of Without = 0 
Alt. hyp. : the mean of Without is not equal to 0 

    mean  n n_missing     sd    se    me
 104.000 36         0 12.981 2.164 4.392

 diff    se t.value p.value df   2.5%   97.5%    
  104 2.164  48.069  < .001 35 99.608 108.392 ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The means could be the same though we have ignored the pairing.

But That Is Not Really the Question

My.Data <- GamingDevices %>% select(With,Without) %>% stack()
My.Data$Subject <- c(GamingDevices$Subject,GamingDevices$Subject)
ggplot(My.Data) +
 aes(x = Subject, y = values, colour = ind) +
 geom_point(size = 3, alpha=0.75) +
 scale_color_viridis_d() + labs(y="Productivity", color="Device?") +
 theme_minimal()

We Can Be Creative at The Command Line

Diff.G.0 <- (GamingDevices$With - GamingDevices$Without > 0)
ggplot(GamingDevices, aes(x=Subject, ymin=Without, ymax=With)) + geom_linerange(aes(color=Diff.G.0)) + geom_point(aes(x=Subject, y=With), color="green", size=2, alpha=0.3) +  geom_point(aes(x=Subject, y=Without), color="red", size=2, alpha=0.3) + theme_minimal() + labs(color="With Bigger?")

The Important Thing is Some Good Visual

Learning from Standard Deviations

Explore

A Paired t

So the device works. It generates 3.76 to 6.24 units per period above and beyond Without.

GamingDevices$Diff <- GamingDevices$With - GamingDevices$Without
result <- single_mean(GamingDevices, var = "Diff")
summary(result)
Single mean test
Data      : GamingDevices 
Variable  : Diff 
Confidence: 0.95 
Null hyp. : the mean of Diff = 0 
Alt. hyp. : the mean of Diff is not equal to 0 

  mean  n n_missing    sd    se    me
 5.000 36         0 3.657 0.609 1.237

 diff    se t.value p.value df  2.5% 97.5%    
    5 0.609   8.204  < .001 35 3.763 6.237 ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

95 Percent Lower Bound?

12.5 additional units. That means, at most, 3.14 periods.

result <- single_mean(
  GamingDevices, 
  var = "Diff", 
  alternative = "greater"
)
summary(result)
Single mean test
Data      : GamingDevices 
Variable  : Diff 
Confidence: 0.95 
Null hyp. : the mean of Diff = 0 
Alt. hyp. : the mean of Diff is > 0 

  mean  n n_missing    sd    se    me
 5.000 36         0 3.657 0.609 1.237

 diff    se t.value p.value df   5% 100%    
    5 0.609   8.204  < .001 35 3.97  Inf ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
100/(t.test(GamingDevices$With-GamingDevices$Without, alternative = "greater")$conf.int[[1]]*8)
[1] 3.148384

The Workflow

Let’s get a feel for the data. How about correlation?

radiant > basics > correlation

First, the question asked and answered and the p-value.

Nothing is uncorrelated because the probability of no correlation is zero for each pair of variables. Everything is correlated with everything else.

Our focus will be the bottom row.

Correlation
Data        : GamingDevices 
Method      : Pearson 
Variables   : General, Dexterity, Without, With 
Null hyp.   : variables x and y are not correlated
Alt. hyp.   : variables x and y are correlated

Correlation matrix:
          General Dexterity Without
Dexterity 0.72                     
Without   0.89    0.53             
With      0.94    0.66      0.96   

p.values:
          General Dexterity Without
Dexterity 0.00                     
Without   0.00    0.00             
With      0.00    0.00      0.00   

Honing In [the Old World]

summary(lm(Without~General, data=GamingDevices))

Call:
lm(formula = Without ~ General, data = GamingDevices)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.0342  -2.8739  -0.6903   3.7688  12.8486 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 65.55449    3.49138   18.78  < 2e-16 ***
General      0.21411    0.01864   11.49 2.99e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.962 on 34 degrees of freedom
Multiple R-squared:  0.7951,    Adjusted R-squared:  0.7891 
F-statistic: 131.9 on 1 and 34 DF,  p-value: 2.99e-13

A Picture

ggplot(GamingDevices, aes(x=General, y=Without)) + geom_point() + geom_smooth(method="lm")

Honing In [the Old World, Part II]

summary(lm(Without~Dexterity, data=GamingDevices))

Call:
lm(formula = Without ~ Dexterity, data = GamingDevices)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.655  -4.390   0.013   3.825  28.814 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 76.99821    7.65028  10.065 9.89e-12 ***
Dexterity    0.15335    0.04214   3.639 0.000899 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.17 on 34 degrees of freedom
Multiple R-squared:  0.2803,    Adjusted R-squared:  0.2591 
F-statistic: 13.24 on 1 and 34 DF,  p-value: 0.000899

A Picture

ggplot(GamingDevices, aes(x=Dexterity, y=Without)) + geom_point() + geom_smooth(method="lm")

What Does Better?

A few answers. What’s the question?

  • If predictive accuracy is the goal, the residual standard error is larger for dexterity than general.

  • If explanatory power is the goal, the r^2 is much smaller for dexterity than general.

  • The graphics suggest that general provides a better fit; note that the points are typically closer to the line [in vertical distance and implies both of the above observations]

Now Some F

Let’s look up F. In the above example, it is proportional to R^2. That’s because there is only one input.

How is F built?

It is a ratio of two \chi^2; or two appropriately scaled normals. Since we are looking at the same variable with a constant metric, this problem can be solved. In the simplest of terms, F will have the ratio of the average explained over the average unexplained. What are we averaging over? Degrees of freedom, from a baseline of n-1.

  • Why? Every slope must pass through the intersection of the mean of x and the mean of y. That’s a restriction on the lines we can draw and the restriction limits freely moving parts – degrees of freedom – by one for each estimated slope.
Model.Dex <- lm(Without ~ Dexterity, data=GamingDevices)
Model.Gen <- lm(Without ~ General, data=GamingDevices)
anova(Model.Dex)
Analysis of Variance Table

Response: Without
          Df Sum Sq Mean Sq F value   Pr(>F)    
Dexterity  1 1653.2 1653.23  13.242 0.000899 ***
Residuals 34 4244.8  124.85                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Model.Gen)
Analysis of Variance Table

Response: Without
          Df Sum Sq Mean Sq F value   Pr(>F)    
General    1 4689.5  4689.5  131.94 2.99e-13 ***
Residuals 34 1208.5    35.5                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Some Basic Regressions

Dependent variable:
With
(1) (2) (3)
General 0.219*** 0.095***
(0.019) (0.018)
Dexterity -0.009 0.059***
(0.023) (0.012)
Without 0.797*** 0.551***
(0.041) (0.075)
Constant 71.286*** 15.670*** 34.560***
(2.935) (3.664) (5.148)
Observations 36 36 36
R2 0.888 0.955 0.957
Adjusted R2 0.881 0.952 0.955
Residual Std. Error (df = 33) 4.223 2.682 2.607
F Statistic (df = 2; 33) 130.667*** 348.301*** 369.641***
Note: p<0.1; p<0.05; p<0.01

Residuals

All are plausibly normal

# Dexterity and General
shapiro.test(Mod.DG$residuals)

    Shapiro-Wilk normality test

data:  Mod.DG$residuals
W = 0.96384, p-value = 0.2816
# Dexterity and Without
shapiro.test(Mod.DW$residuals)

    Shapiro-Wilk normality test

data:  Mod.DW$residuals
W = 0.98128, p-value = 0.7876
# General and Without
shapiro.test(Mod.GW$residuals)

    Shapiro-Wilk normality test

data:  Mod.GW$residuals
W = 0.9639, p-value = 0.2828

Why It Matters?

If residuals are not normal, then the slopes are not t and ratios of variance are not F. Everything we might want to use the regression for falls apart. For now at least.

The Plots [Seems Fine]

[1] 24 31
[1] 16 13
[1] 16 28

Formal Evidence

library(gvlma)
gvlma(Mod.DG)

Call:
lm(formula = With ~ General + Dexterity, data = GamingDevices)

Coefficients:
(Intercept)      General    Dexterity  
  71.286437     0.218684    -0.008816  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = Mod.DG) 

                    Value p-value                Decision
Global Stat        7.3824 0.11701 Assumptions acceptable.
Skewness           2.6704 0.10223 Assumptions acceptable.
Kurtosis           0.1628 0.68659 Assumptions acceptable.
Link Function      3.6686 0.05545 Assumptions acceptable.
Heteroscedasticity 0.8806 0.34804 Assumptions acceptable.
gvlma(Mod.DW)

Call:
lm(formula = With ~ Dexterity + Without, data = GamingDevices)

Coefficients:
(Intercept)    Dexterity      Without  
   15.67018      0.05927      0.79705  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = Mod.DW) 

                    Value p-value                Decision
Global Stat        2.8349  0.5858 Assumptions acceptable.
Skewness           0.9045  0.3416 Assumptions acceptable.
Kurtosis           0.7883  0.3746 Assumptions acceptable.
Link Function      1.0035  0.3165 Assumptions acceptable.
Heteroscedasticity 0.1387  0.7096 Assumptions acceptable.
gvlma(Mod.GW)

Call:
lm(formula = With ~ General + Without, data = GamingDevices)

Coefficients:
(Intercept)      General      Without  
   34.55964      0.09543      0.55101  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = Mod.GW) 

                       Value p-value                Decision
Global Stat        3.739e+00  0.4425 Assumptions acceptable.
Skewness           5.974e-04  0.9805 Assumptions acceptable.
Kurtosis           2.699e+00  0.1004 Assumptions acceptable.
Link Function      3.815e-05  0.9951 Assumptions acceptable.
Heteroscedasticity 1.039e+00  0.3080 Assumptions acceptable.

Which to Choose?

  • Without and General provide the best explanation.
  • Without and Dexterity are a close second.
  • General and Dexterity explain almost 90% but trail far behind.

From our chosen model, a few cool things….

A Plot

library(effects)
plot(allEffects(Mod.GW))

A Prediction

Pred.Data <- data.frame(General=119, Without=86)
Pred.Data  # Newdata
  General Without
1     119      86
predict(Mod.GW, newdata=Pred.Data, interval="confidence", level=.95, 
  se.fit=FALSE)
      fit      lwr      upr
1 93.3029 91.76003 94.84578
predict(Mod.GW, newdata=Pred.Data, interval="prediction", level=.95, 
  se.fit=FALSE)
      fit      lwr      upr
1 93.3029 87.77848 98.82733

Black Boxes

F has an accompanying visual

It is the ratio of two boxes where each one is normalized by the degrees of freedom.

It Also Has a Marginal Interpretation

Mod.WithWithout <- lm(With~Without , data=GamingDevices)
anova(Mod.WithWithout, lm(With~Without+General, data=GamingDevices))
Analysis of Variance Table

Model 1: With ~ Without
Model 2: With ~ Without + General
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     34 415.21                                  
2     33 224.34  1    190.87 28.078 7.638e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Mod.WithWithout, lm(With~Without+Dexterity, data=GamingDevices))
Analysis of Variance Table

Model 1: With ~ Without
Model 2: With ~ Without + Dexterity
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     34 415.21                                  
2     33 237.46  1    177.75 24.702 2.014e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both Add Something at the Margin

And there is still something to be learned at the margins.

anova(lm(With~Without+General, data=GamingDevices),lm(With~Without+General+Dexterity, data=GamingDevices))
Analysis of Variance Table

Model 1: With ~ Without + General
Model 2: With ~ Without + General + Dexterity
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     33 224.34                              
2     32 191.46  1    32.875 5.4946 0.02545 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualizing Regression

library(visreg)
visreg2d(Mod.GW, "General", "Without")

A single plot for HR to deploy.

About the Purported Outlier?

What should we do? We could always reconstruct the analysis. It is surprisingly easy; think about it for a second.

  • All we need to do is remove the outlier and recompile the document. Reproducible analysis saves time.
  • The state file on posit.cloud and in the course directory do exactly this with a bit more detail.

A Summary [in One Graphic]

A 3D Graphic